Preambule

This document shows how to combine 3 data sets:

These three data sets provide information by district (roughly 700 in Vietnam). The difficulties come from the fact that these 3 datasets listed above do not have exactly the same districts definitions. Indeed, as the population grows with time, districts tend to split. The problems we had to deal with here are

Packages

The needed packages:

library(sf)
library(stars)
library(osmdata)
library(magrittr)
library(stringr)
library(lubridate)
library(tidyr)
library(purrr)
library(dplyr) # best to load last
library(data.table)
library(ggplot2)

Redefining the hist() function:

hist2 <- function(...) hist(..., main = NA)

Population density raster data

Downloading the population density data from WorldPop:

setwd("~/Desktop/OUCRU/FacebookData/ColocationMarc")
download.file("ftp://ftp.worldpop.org.uk/GIS/Population/Individual_countries/VNM/Viet_Nam_100m_Population/VNM_ppp_v2b_2020_UNadj.tif",
              "VNM_ppp_v2b_2020_UNadj.tif")

Loading the data:

worldpop <- read_stars("VNM_ppp_v2b_2020_UNadj.tif")

Google and Apple mobility data

Downloading the population mobility data:

download.file("https://www.dropbox.com/s/6fl62gcuma9890f/google.rds?raw=1", "google.rds")
download.file("https://www.dropbox.com/s/uuxxjm3cgs0a4gw/apple.rds?raw=1", "apple.rds")

Loading the data:

google <- readRDS("google.rds")
apple <- readRDS("apple.rds") %>% 
  mutate_if(is.numeric, subtract, 100)

Polygons from GADM

Downloading the polygons from GADM:

download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_VNM_0_sf.rds", "gadm36_VNM_0_sf.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_VNM_1_sf.rds", "gadm36_VNM_1_sf.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_VNM_2_sf.rds", "gadm36_VNM_2_sf.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm2.8/rds/VNM_adm2.rds"       , "VNM_adm2.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm2.8/rds/VNM_adm3.rds"       , "VNM_adm3.rds")

Loading the polygons:

vn0 <- readRDS("gadm36_VNM_0_sf.rds")     # country polygon
vn1 <- readRDS("gadm36_VNM_1_sf.rds")     # provinces polygons

vn2 <- readRDS("gadm36_VNM_2_sf.rds") %>% # districts polygons
  transmute(province = str_squish(NAME_1),
            district = str_squish(NAME_2) %>%
              str_remove_all("Thành Phố |Thị Xã |Quận "))

vn2_old <- readRDS("VNM_adm2.rds") %>%    # old version of the districts polygons
  st_as_sf() %>% 
  transmute(province = str_squish(NAME_1),
            district = str_squish(NAME_2) %>% 
              str_remove_all("Thành Phố |Thị Xã |Quận ") %>% 
              str_replace("Chiêm Hoá", "Chiêm Hóa"))

vn3_old <- readRDS("VNM_adm3.rds") %>%    # old version of the communes polygons
  st_as_sf() %>% 
  transmute(province = str_squish(NAME_1),
            district = str_squish(NAME_2) %>% 
              str_remove_all("Thành Phố |Thị Xã |Quận ") %>% 
              str_replace("Chiêm Hoá", "Chiêm Hóa"),
            commune = str_squish(NAME_3)) %>% 
  arrange(province, district, commune)

Removing the commune Huæi Luông from the district Sìn Hồ:

vn2_old %<>% 
  filter(district == "Sìn Hồ") %>% 
  st_set_geometry(st_union(filter(vn3_old, district == "Sìn Hồ", commune != "Huæi Luông"))) %>% 
  rbind(filter(vn2_old, district != "Sìn Hồ")) %>% 
  arrange(province, district)

Adding the polygon for Côn Đảo from OpenStreetMap

Downloading the administrative boundaries from OpenStreetMap:

con_dao <- c(106.523384, 8.622214, 106.748218, 8.782639) %>%
  opq() %>% 
  add_osm_feature(key = "boundary", value = "administrative") %>% 
  osmdata_sf()

The main island is made of lines. Let’s transform them into a polygon:

main_island <- con_dao$osm_lines %>%
  st_combine() %>%
  st_polygonize() %>%
  st_geometry() %>%
  st_collection_extract("POLYGON")

The other islands are already polygons. Let’s extract and merge them with the newly created polygon of the main island:

con_dao <- con_dao$osm_polygons %>%
  st_geometry() %>% 
  c(main_island) %>% 
  st_multipolygon() %>% 
  st_sfc(crs = st_crs(vn2))

Here are the polygons for Côn Đảo:

con_dao %>% 
  st_geometry() %>% 
  plot(col = "grey")

Let’s add the polygon of Côn Đảo to the GADM data:

vn2 %<>%
  filter(province == "Bà Rịa - Vũng Tàu") %>% 
  head(1) %>% 
  mutate(district = "Côn Đảo") %>% 
  st_set_geometry(con_dao) %>% 
  rbind(vn2)

Adding the census 2019 data

For some reason the province of Vĩnh Long is missing… We add information form Wikipedia:

census <- census_path %>%
  readRDS() %>% 
  group_by(province, district) %>% 
  summarise(n = sum(n)) %>% 
  ungroup() %>% 
  mutate(province = str_squish(province) %>%
                    str_remove_all("Thành phố |Tỉnh ") %>% 
                    str_replace("Đăk Lăk"             , "Đắk Lắk") %>% 
                    str_replace("Đăk Nông"            , "Đắk Nông") %>% 
                    str_replace("Khánh Hoà"           , "Khánh Hòa") %>% 
                    str_replace("Thanh Hoá"           , "Thanh Hóa"),
         district = str_squish(district) %>%
                    str_replace("Thành phố Cao Lãnh"  , "Cao Lãnh (Thành phố)") %>% 
                    str_replace("Thị xã Hồng Ngự"     , "Hồng Ngự (Thị xã)") %>% 
                    str_replace("Thị xã Kỳ Anh"       , "Kỳ Anh (Thị xã)") %>% 
                    str_replace("Thị xã Long Mỹ"      , "Long Mỹ (Thị xã)") %>% 
                    str_replace("Thị xã Cai Lậy"      , "Cai Lậy (Thị xã)") %>% 
                    str_replace("Thị xã Duyên Hải"    , "Duyên Hải (Thị xã)") %>% 
                    str_remove_all("Huyện |Huỵên |Quận |Thành phố |Thành Phô |Thành Phố |Thị xã |Thị Xã ") %>% 
                    str_replace("Hoà Vang"            , "Hòa Vang") %>% 
                    str_replace("Ứng Hoà"             , "Ứng Hòa") %>% 
                    str_replace("Đồng Phù"            , "Đồng Phú") %>% 
                    str_replace("Đắk Glong"           , "Đăk Glong") %>% 
                    str_replace("Ia H’Drai"           , "Ia H' Drai") %>% 
                    str_replace("ý Yên"               , "Ý Yên") %>% 
                    str_replace("Bác ái"              , "Bác Ái") %>% 
                    str_replace("Phan Rang- Tháp Chàm", "Phan Rang-Tháp Chàm") %>% 
                    str_replace("Đông Hoà"            , "Đông Hòa") %>% 
                    str_replace("Tuy Hòa"             , "Tuy Hoà") %>% 
                    str_replace("Thiệu Hoá"           , "Thiệu Hóa")) %>% 
  bind_rows(
    bind_cols(
      data.frame(province = rep("Vĩnh Long", 8)),
      tribble(
        ~district  ,     ~n,
        "Bình Tân" ,  93758, # 2009
        "Long Hồ"  , 160537, # 2018
        "Mang Thít", 103573, # 2018
        "Tam Bình" , 162191, # 2003
        "Trà Ôn"   , 149983, # 2003
        "Vũng Liêm", 176233, # 2003
        "Bình Minh",  95282, # 2003
        "Vĩnh Long", 200120  # 2018
      )
    ),
    tribble(
      ~province  , ~district     , ~n, 
      "Điện Biên", "Mường Ảng"   ,  37077, # 2006
      "Hải Phòng", "Bạch Long Vĩ",    912, # 2018
      "Phú Thọ"  , "Thanh Sơn"   , 187700, # 2003
      "Quảng Trị", "Cồn Cỏ"      ,    400  # 2003
    )
  )

Let’s check that the names of the provinces in the GADM and census data sets are the same:

identical(sort(unique(census$province)), sort(unique(vn2$province)))
## [1] TRUE

Let’s check that the districts in the GADM and the census data are the same:

nrow(anti_join(census, vn2, c("province", "district"))) < 1
## [1] TRUE
nrow(anti_join(vn2, census, c("province", "district"))) < 1
## [1] TRUE

Let’s merge the census and GADM data sets:

vn2 %<>% left_join(census, c("province", "district"))

Let’s check that no district is duplicated:

# If true, then no district is duplicated
vn2 %>%
  st_drop_geometry() %>%
  as.data.table() %>%
  setkey(province, district) %>%
  duplicated() %>%
  sum() %>%
  is_less_than(1)
## [1] TRUE

Merging some districts

The colocation data use the Bing polygons, which do not seem to be very much up to date. In order to adjust for that, we need to merge the following districts in the GADM data:

Furthermore, some of the districts need to be split in two, with each part merged to a different district. That’s the case for these 3 districts:

As illustrated below:

plot_districts <- function(d1, d2, d3) {
  tmp <- vn2 %>% 
    filter(district %in% c(d1, d2, d3)) %>% 
    st_geometry()
  
  plot(tmp)
  plot(worldpop, add = TRUE, main = NA)
  plot(tmp, add = TRUE, col = adjustcolor("green", .2))
  
  vn2 %>% 
    filter(district == d1) %>% 
    st_geometry() %>% 
    plot(add = TRUE, col = adjustcolor("red", .2))
  
  vn2_old %>% 
    filter(district %in% c(d2, d3)) %>% 
    st_geometry() %>% 
    plot(add = TRUE, border = "blue", lwd = 2)
}
plot_districts("Nậm Pồ"  , "Mường Chà", "Mường Nhé")

plot_districts("Lâm Bình", "Nà Hang"  , "Chiêm Hóa")

plot_districts("Nậm Nhùn", "Mường Tè" , "Sìn Hồ")

The raster data shows the population density from WorldPop that we’ll use to split the population of the red district into the 2 blue districts. Here are two functions to fix these 2 above-mentioned problems. Here is the first function:

merge_districts <- function(vn, d1, d2, p) {
  dst <- c(d1, d2)
  
  tmp <- vn %>% 
    filter(district %in% dst, province == p) %>% 
    mutate(n = sum(n))
  
  tmp %<>% 
    filter(district %in% d1) %>% 
    st_set_geometry(st_union(tmp))
  
  vn %>% 
    filter(! (district %in% dst & province == p)) %>% 
    rbind(tmp) %>% 
    arrange(province, district)
}

The second function needs this function:

proportion <- function(to_cut, one_district, new_vn = vn2, old_vn = vn2_old, rstr = worldpop) {
  to_cut <- filter(new_vn, district == to_cut)
  one_part <- st_intersection(to_cut, filter(old_vn, district == one_district))
  
  wp0 <- rstr %>%
    st_crop(to_cut) %>% 
    st_as_stars() %>% 
    unlist() %>% 
    sum(na.rm = TRUE)
  
  rstr %>% 
    st_crop(one_part) %>% 
    st_as_stars() %>% 
    unlist() %>% 
    sum(na.rm = TRUE) %>% 
    divide_by(wp0)
}

Let’s try it:

proportion("Nậm Nhùn", "Mường Tè" , vn2, vn2_old, worldpop)
## [1] 0.6717576
proportion("Nậm Pồ"  , "Mường Chà", vn2, vn2_old, worldpop)
## [1] 0.4612189
proportion("Lâm Bình", "Nà Hang"  , vn2, vn2_old, worldpop)
## [1] 0.7201902

And here is the second function we needed:

merge_back_districts <- function(c2, d1, d2, d3, c1 = vn2_old, rstr = worldpop) {
  dsts <- c(d1, d2, d3)
  
  tmp <- c2 %>% 
    filter(district %in% dsts) %$%
    setNames(n, district)

  half1 <- round(proportion(d1, d2, c2, c1, rstr) * tmp[d1])

  half2 <- tmp[d1] - half1
  tmp[d2] <- tmp[d2] + half1
  tmp[d3] <- tmp[d3] + half2
  tmp <- tmp[dsts[-1]]
  
  c1 %>% 
    filter(district %in% dsts[-1]) %>% 
    mutate(n = tmp) %>% 
    select(everything(), geometry) %>% 
    rbind(filter(c2, ! district %in% dsts)) %>% 
    arrange(province, district)
}

Let’s now call these 2 functions to do the mergings:

vn2 %<>%
  merge_districts("Kỳ Anh"     , "Kỳ Anh (Thị xã)"   , "Hà Tĩnh") %>% 
  merge_districts("Long Mỹ"    , "Long Mỹ (Thị xã)"  , "Hậu Giang") %>% 
  merge_districts("Cai Lậy"    , "Cai Lậy (Thị xã)"  , "Tiền Giang") %>% 
  merge_districts("Duyên Hải"  , "Duyên Hải (Thị xã)", "Trà Vinh") %>% 
  merge_districts("Tân Uyên"   , "Bắc Tân Uyên"      , "Bình Dương") %>%
  merge_districts("Bến Cát"    , "Bàu Bàng"          , "Bình Dương") %>% 
  merge_districts("Bắc Từ Liêm", "Nam Từ Liêm"       , "Hà Nội") %>% # then rename to Từ Liêm
  merge_districts("Mộc Hóa"    , "Kiến Tường"        , "Long An") %>% 
  merge_districts("Quỳnh Lưu"  , "Hoàng Mai"         , "Nghệ An") %>% 
  merge_districts("Quảng Trạch", "Ba Đồn"            , "Quảng Bình") %>% 
  merge_back_districts("Nậm Pồ"  , "Mường Chà", "Mường Nhé") %>% 
  merge_back_districts("Nậm Nhùn", "Mường Tè" , "Sìn Hồ") %>% 
  merge_back_districts("Lâm Bình", "Nà Hang"  , "Chiêm Hóa") %>% 
  mutate(district = str_replace(district, "Bắc Từ Liêm", "Từ Liêm")) # here the renaming

Let’s calculate and add the areas:

vn2 %<>% 
  mutate(area_km2 = vn2 %>%
           st_geometry() %>% 
           st_area() %>% 
           as.numeric() %>% 
           divide_by(1e6)) %>% 
  select(everything(), geometry)
head(vn2)
## Simple feature collection with 6 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 105.0233 ymin: 10.30302 xmax: 105.5753 ymax: 10.96207
## CRS:            +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
##   province   district      n area_km2                       geometry
## 1 An Giang     An Phú 183074 225.1253 MULTIPOLYGON (((105.1216 10...
## 2 An Giang   Châu Đốc 117244 104.0911 MULTIPOLYGON (((105.1387 10...
## 3 An Giang   Châu Phú 255894 450.9987 MULTIPOLYGON (((105.327 10....
## 4 An Giang Châu Thành 176945 354.9705 MULTIPOLYGON (((105.3082 10...
## 5 An Giang    Chợ Mới 353705 369.1722 MULTIPOLYGON (((105.4729 10...
## 6 An Giang Long Xuyên 303381 114.2415 MULTIPOLYGON (((105.4729 10...

Let’s calculate and add the population densities:

vn2 %<>%
  mutate(den_km2 = n / area_km2) %>%
  select(everything(), geometry)
head(vn2)
## Simple feature collection with 6 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 105.0233 ymin: 10.30302 xmax: 105.5753 ymax: 10.96207
## CRS:            +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
##   province   district      n area_km2   den_km2                       geometry
## 1 An Giang     An Phú 183074 225.1253  813.2095 MULTIPOLYGON (((105.1216 10...
## 2 An Giang   Châu Đốc 117244 104.0911 1126.3596 MULTIPOLYGON (((105.1387 10...
## 3 An Giang   Châu Phú 255894 450.9987  567.3941 MULTIPOLYGON (((105.327 10....
## 4 An Giang Châu Thành 176945 354.9705  498.4781 MULTIPOLYGON (((105.3082 10...
## 5 An Giang    Chợ Mới 353705 369.1722  958.1031 MULTIPOLYGON (((105.4729 10...
## 6 An Giang Long Xuyên 303381 114.2415 2655.6104 MULTIPOLYGON (((105.4729 10...

Visualizations of the GADM / census data

Population sizes

The distribution of the districts’ population sizes:

hist2(vn2$n, n = 50, xlab = "population size", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 1e6, 2e5), format(seq(0, 10, 2) * 1e5, big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2)

Let’s define a palette of colors:

cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)

The distribution of the districts’ population sizes where all the bars are of the same area and represent one decile of the data:

hist2(vn2$n, quantile(vn2$n, seq(0, 1, le = 11)), col = pal, axes = FALSE,
      xlab = "population size", ylab = "density of probability")
axis(1, seq(0, 1e6, 2e5), format(seq(0, 10, 2) * 1e5, big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2)

quantile(vn2$n, seq(0, 1, le = 11))
##        0%       10%       20%       30%       40%       50%       60%       70% 
##     400.0   44849.5   64843.2   85749.6  104927.4  122217.0  144226.2  163635.0 
##       80%       90%      100% 
##  195393.4  249490.8 1035938.0

Let’s map the population sizes of the districts:

vn2 %>% 
  st_geometry() %>% 
  plot(lwd = .1, col = pal[cut(vn2$n, quantile(vn2$n, seq(0, 1, le = 11)))], main = NA)

vn0 %>% 
  st_geometry() %>% 
  plot(add = TRUE)

The mean and variance of the district population sizes:

mean(vn2$n)
## [1] 139978.6
median(vn2$n)
## [1] 122217

Areas

The distribution of the districts’ areas:

hist2(vn2$area_km2, n = 50,
      xlab = expression(paste("area ", (km^2))), ylab = "number of districts")

Mean and variance of the districts’s areas:

mean(vn2$area_km2)
## [1] 471.8696
median(vn2$area_km2)
## [1] 339.3427

Densities

Some quantiles of the districts’ densities:

(quants <- quantile(vn2$den_km2, c(.025, .25, .5, .75, .975)))
##        2.5%         25%         50%         75%       97.5% 
##    34.62209   115.50101   397.09968  1019.86255 20341.46013

The distribution of the districts’ densities, on a log scale, with quantiles:

hist2(log10(as.numeric(vn2$den_km2)), n = 50, axes = FALSE,
      xlab = expression(paste("density (/", km^2, ")")), ylab = "number of districts")
axis(1, 1:4, c("10", "100", "1000", "10000"))
axis(2)
abline(v = log10(quants), lty = c(3, 2, 1, 2, 3), col = "blue", lwd = 2)

Same distribution as above where all the bars have the same area representing 10% of the data:

xs <- log10(vn2$den_km2)
hist2(xs, quantile(xs, seq(0, 1, le = 11)), col = pal, axes = FALSE,
      xlab = expression(paste("density (/", km^2, ")")), ylab = "density of probability")
axis(1, 1:4, c("10", "100", "1000", "10000"))
axis(2)

Mapping the districts’ populations densities:

vn2 %>% 
  select(den_km2) %>% 
  st_geometry() %>% 
  plot(lwd = .1, col = pal[cut(vn2$n, quantile(vn2$n, seq(0, 1, le = 11)))], main = NA)

vn0 %>% 
  st_geometry() %>% 
  plot(add = TRUE)

The relationship between the population size and the population density:

vn2 %$%
  plot(log10(n), log10(den_km2), col = "blue", axes = FALSE, xlab = "population size",
       ylab = expression(paste("population density (/", km^2, ")")))

axis(1, 3:6, format(10^(3:6), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10^(1:4), big.mark = ",", scientific = FALSE, trim = TRUE))

meaning that the density is increasing with some power of the population size.

Colocation data

The list of colocation data files:

files <- dir(colocation_path)

There are 17 of them:

length(files)
## [1] 17

Making names from files names:

weeks <- str_remove_all(files, "^.*__|.csv")

Loading the colocation data into a list (one slot per week):

colocation <- paste0(colocation_path, dir(colocation_path)) %>%
  map(readr::read_csv) %>%
  setNames(weeks) %>% 
  map(select, -country, -ds) # remove the country code (useless) and the ds (which is the name of the slot)

The colocation data looks like this:

head(colocation, 1)
## $`2020-03-03`
## # A tibble: 448,665 x 13
##    polygon1_id polygon1_name lon_1 lat_1 name_stack_1 fb_population_1
##          <dbl> <chr>         <dbl> <dbl> <chr>                  <dbl>
##  1      834298 Huyện Lý Sơn   109.  15.4 Quảng Ngãi …             156
##  2      834671 Huyện Đạ Tẻh   108.  11.6 Lâm Đồng Pr…             123
##  3      834269 Huyện Hương …  107.  16.4 Thừa Thiên …             850
##  4      834290 Huyện Hiệp Đ…  108.  15.5 Quảng Nam P…             206
##  5      834320 Thị Xã Bình …  107.  11.7 Bình Phước …             991
##  6      834672 Huyện Cát Ti…  107.  11.7 Lâm Đồng Pr…             109
##  7      834652 Huyện Tuy Ph…  109.  11.4 Bình Thuận …            1555
##  8      834798 Huyện Đan Ph…  106.  21.1 Hanoi City …            2640
##  9      834599 Huyện Đông H…  106.  20.5 Thái Bình P…            1819
## 10      834336 Huyện Vĩnh H…  106.  10.9 Long An Pro…             422
## # … with 448,655 more rows, and 7 more variables: polygon2_id <dbl>,
## #   polygon2_name <chr>, lon_2 <dbl>, lat_2 <dbl>, name_stack_2 <chr>,
## #   fb_population_2 <dbl>, link_value <dbl>

The slot names are the last day of the 7-day period over which the data are collected:

names(colocation)
##  [1] "2020-03-03" "2020-03-10" "2020-03-17" "2020-03-24" "2020-03-31"
##  [6] "2020-04-07" "2020-04-14" "2020-04-21" "2020-04-28" "2020-05-05"
## [11] "2020-05-12" "2020-05-19" "2020-05-26" "2020-06-02" "2020-06-09"
## [16] "2020-06-16" "2020-06-23"

Getting rid of whatever is not linked to Vietnam

The problem

Here we show what the problem is (i.e. some of the data are outside Vietnam). The following function plots the colocation data:

plot_fb <- function(df, xlim, ylim, col) {
  plot(xlim, ylim, asp = 1, xlab = "longitude", ylab = "latitude", type = "n")
  maps::map(col = "grey", fill = TRUE, add = TRUE)
  points(df[[1]], df[[2]], pch = ".", col = col)
  axis(1)
  axis(2)
  box(bty = "o")
}

Let’s consider one week:

june23 <- colocation$`2020-06-23`

The locations in this week are within these boundaries:

(xlim <- range(range(june23$lon_1), range(june23$lon_2)))
## [1] 101.5709 109.3583
(ylim <- range(range(june23$lat_1), range(june23$lat_2)))
## [1]  8.668113 23.237631

Let’s plot the polygon 1:

june23 %>%
  select(lon_1, lat_1) %>% 
  distinct() %>% 
  plot_fb(xlim, ylim, "blue")

And the polygon 2:

june23 %>%
  select(lon_2, lat_2) %>% 
  distinct() %>% 
  plot_fb(xlim, ylim, "red")

The solution

In the following function, df is a data frame with the same column names as a “colocation map” data frame. pl is an sf non-projected polygon. type is either 1 or 2.

pts_in_pol <- function(type, df, pl, project = FALSE) {
  # assumes that sf is not projected.
  # 4326: non-projected
  # 3857: pseudo-Mercator (e.g. Google Maps)
  df %<>% st_as_sf(coords = paste0(c("lon_", "lat_"), type), crs = 4326)
  if (project) {
    df %<>% st_transform(3857)
    pl %<>% st_transform(3857)
  }
  df %>%
    st_intersects(pl) %>% 
    map_int(length)
}

The function returns a vector of length equal to the number of rows of df with 1 if the points is inside the polygon pl and 0 otherwise. Let’s try it:

tmp <- pts_in_pol(1, june23, vn0)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar

It takes 3.5’ and it’s about 3 times slower if we project the points and polygon. The arguments of the following function are the same as the previous one. It uses the previous one to delete the records from df that have start and end coordinates that are outside pl:

rcd_in_pol <- function(df, pl, project = FALSE) {
  require(magrittr)
  1:2 %>%
    parallel::mclapply(pts_in_pol, df, pl, project, mc.cores = 2) %>% 
    as.data.frame() %>%
    rowSums() %>% 
    is_greater_than(0) %>% 
    magrittr::extract(df, ., ) # there is an extract() function in tidyr too
}

Let’s try it:

june23 %<>% rcd_in_pol(vn0)

Let’s process all the data (takes about 1 minute):

colocation %<>% map(rcd_in_pol, vn0)
colocation <- readRDS("colocation2.rds")

Working out a district ID common to GADM + census and colocation data

The list of district in the colocation data:

col_names <- c("polygon_id", "polygon_name", "lon", "lat", "name_stack")

districts1 <- map_dfr(colocation, select, polygon1_id, polygon1_name, lon_1, lat_1, name_stack_1) %>% 
  setNames(col_names)

districts2 <- map_dfr(colocation, select, polygon2_id, polygon2_name, lon_2, lat_2, name_stack_2) %>% 
  setNames(col_names)

districts <- bind_rows(districts1, districts2) %>%
  distinct()

This is what it looks like:

districts
## # A tibble: 693 x 5
##    polygon_id polygon_name       lon   lat name_stack                           
##         <dbl> <chr>            <dbl> <dbl> <chr>                                
##  1     834298 Huyện Lý Sơn      109.  15.4 Quảng Ngãi Province // Huyện Lý Sơn  
##  2     834671 Huyện Đạ Tẻh      108.  11.6 Lâm Đồng Province // Huyện Đạ Tẻh    
##  3     834269 Huyện Hương Trà   107.  16.4 Thừa Thiên Huế Province // Huyện Hươ…
##  4     834290 Huyện Hiệp Đức    108.  15.5 Quảng Nam Province // Huyện Hiệp Đức 
##  5     834320 Thị Xã Bình Long  107.  11.7 Bình Phước Province // Thị Xã Bình L…
##  6     834672 Huyện Cát Tiên    107.  11.7 Lâm Đồng Province // Huyện Cát Tiên  
##  7     834652 Huyện Tuy Phong   109.  11.4 Bình Thuận Province // Huyện Tuy Pho…
##  8     834798 Huyện Đan Phượng  106.  21.1 Hanoi City // Huyện Đan Phượng       
##  9     834599 Huyện Đông Hưng   106.  20.5 Thái Bình Province // Huyện Đông Hưng
## 10     834336 Huyện Vĩnh Hưng   106.  10.9 Long An Province // Huyện Vĩnh Hưng  
## # … with 683 more rows
saveRDS(districts, file="coloc_district.RDS")

Province name missing for some districts of Hanoi

In the colocation data, the name_stack_* variables contains the names of the province and the district separated by //. The problem is that there are a number of districts that do not have // in their name_stack variable and all of them seem to be in the province of Hanoi:

plot(st_geometry(vn0), col = "grey")

vn1 %>%
  filter(NAME_1 == "Hà Nội") %>%
  st_geometry() %>%
  plot(add = TRUE, col = "yellow")

districts %>% 
  filter(! grepl(" // ", name_stack)) %>% 
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>% 
  st_geometry() %>% 
  plot(add = TRUE, col = "red")

Separating province name from district name in the colocation data

Plus a number of names fixes:

districts %<>% 
  separate(name_stack, c("province", "district"), " // ") %>% 
  mutate(indicate = is.na(district),
         district = ifelse(indicate, province, district),
         province = ifelse(indicate, "Hanoi City", province) %>% 
           str_squish() %>% 
           str_remove(" Province| City") %>% 
           str_replace("-", " - ") %>% 
           str_replace("Da Nang"     , "Đà Nẵng") %>%
           str_replace("Hanoi"       , "Hà Nội") %>% 
           str_replace("Hai Phong"   , "Hải Phòng") %>%
           str_replace("Ho Chi Minh" , "Hồ Chí Minh") %>%
           str_replace("Hòa Bình"    , "Hoà Bình"),
         polygon_name = str_squish(polygon_name) %>% 
           str_replace("Thành Phố Cao Lãnh", "Cao Lãnh (Thành phố)") %>% 
           str_replace("Thị Xã Hồng Ngự", "Hồng Ngự (Thị xã)") %>% 
           str_remove("Huyện |Thành phố |Thị xã |Quận |Thành Phố |Thị Xã ") %>%
           str_replace("Quy Nhơn"    , "Qui Nhơn") %>% 
           str_replace("Đảo Phú Quý" , "Phú Quí") %>% 
           str_replace("Bình Thủy"   , "Bình Thuỷ") %>% 
           str_replace("Hòa An"      , "Hoà An") %>% 
           str_replace("Phục Hòa"    , "Phục Hoà") %>% 
           str_replace("Thái Hòa"    , "Thái Hoà") %>% 
           str_replace("Hạ Hòa"      , "Hạ Hoà") %>% 
           str_replace("Phú Hòa"     , "Phú Hoà") %>% 
           str_replace("Tây Hòa"     , "Tây Hoà") %>% 
           str_replace("Tuy Hòa"     , "Tuy Hoà") %>% 
           str_replace("Krông Ana"   , "Krông A Na") %>% 
           str_replace("Krông A Na"  , "Krông A Na") %>% ##
           str_replace("Krông Păk"   , "Krông Pắc") %>% 
           str_replace("Krông Pắc"   , "Krông Pắc") %>% ##
           str_replace("Đắk Glong"   , "Đăk Glong") %>% 
           str_replace("Đắk Rlấp"    , "Đắk R'Lấp") %>% 
           str_replace("A Yun Pa"    , "Ayun Pa") %>% 
           str_replace("Từ Liêm"     , "Nam Từ Liêm") %>% 
           str_replace("Kiến Thụy"   , "Kiến Thuỵ") %>% 
           str_replace("Thủy Nguyên" , "Thuỷ Nguyên") %>% 
           str_replace("Vị Thủy"     , "Vị Thuỷ") %>% 
           str_replace("Bác Ai"      , "Bác Ái") %>% 
           str_replace("Thanh Thủy"  , "Thanh Thuỷ") %>% 
           str_replace("Yên Hưng"    , "Quảng Yên") %>% 
           str_replace("Na Hang"     , "Nà Hang") %>% 
           str_replace("Mù Cang Chải", "Mù Căng Chải") %>%
           str_replace("M`Đrắk"      , "M'Đrắk") %>% 
           str_replace("Cư M`Gar"    , "Cư M'gar") %>% 
           str_replace("Ea H`Leo"    , "Ea H'leo") %>% 
           str_replace("Nam Từ Liêm" , "Từ Liêm") %>% 
           str_replace("Buôn Hồ"     , "Buôn Hồ"),
         polygon_name = ifelse(province == "Bạc Liêu" & polygon_name == "Hòa Bình",
                               "Hoà Bình", polygon_name)) %>% 
  select(-indicate, -district) %>% 
  rename(district = polygon_name)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [82, 210,
## 309, 593].

The warnings are produced when the provinces names are missing for some of the districts of Hanoi. Some districts do not have any information in the colocation data:

anti_join(districts, vn2, c("province", "district"))
## # A tibble: 0 x 5
## # … with 5 variables: polygon_id <dbl>, district <chr>, lon <dbl>, lat <dbl>,
## #   province <chr>
anti_join(vn2, districts, c("province", "district"))
## Simple feature collection with 5 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 104.636 ymin: 11.60531 xmax: 107.7464 ymax: 21.04582
## CRS:            +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
##     province     district     n    area_km2   den_km2
## 1 Bình Phước    Phú Riềng 97931  667.330016 146.75048
## 2  Hải Phòng Bạch Long Vĩ   912   15.273999  59.70931
## 3    Kon Tum   Ia H' Drai  6638 1159.609633   5.72434
## 4  Quảng Trị       Cồn Cỏ   400    2.176718 183.76293
## 5     Sơn La       Vân Hồ 61197  969.335926  63.13291
##                         geometry
## 1 MULTIPOLYGON (((106.8315 11...
## 2 MULTIPOLYGON (((107.7457 20...
## 3 MULTIPOLYGON (((107.4588 13...
## 4 MULTIPOLYGON (((107.343 17....
## 5 MULTIPOLYGON (((105.0134 20...

Let’s map these districts that never have any data:

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

tmp <- vn2 %>% 
  mutate(pd = paste(province, district)) %>% 
  filter(! pd %in% with(districts, paste(province, district)))

tmp %>% 
  st_geometry() %>% 
  plot(add = TRUE, col = "red")

tmp %>%
  st_centroid() %>% 
  st_coordinates() %>% 
  as_tibble() %>% 
  filter(between(Y, 15, 20.5)) %>%
  st_as_sf(coords = c("X", "Y")) %>% 
  plot(add = TRUE, col = "red")

Let’s add these 5 districts to districts:

tmp <- vn2 %>% 
  mutate(pd = paste(province, district)) %>% 
  filter(! pd %in% with(districts, paste(province, district)))

districts <- tmp %>%
  st_centroid() %>% 
  st_coordinates() %>% 
  as_tibble() %>% 
  transmute(lon = X, lat = Y) %>% 
  bind_cols(tmp) %>% 
  mutate(polygon_id = 850001:850005) %>% 
  select(polygon_id, district, lon, lat, province) %>% 
  bind_rows(districts)

This is what it looks like now:

districts
## # A tibble: 698 x 5
##    polygon_id district       lon   lat province      
##         <dbl> <chr>        <dbl> <dbl> <chr>         
##  1     850001 Phú Riềng     107.  11.7 Bình Phước    
##  2     850002 Bạch Long Vĩ  108.  20.1 Hải Phòng     
##  3     850003 Ia H' Drai    108.  14.2 Kon Tum       
##  4     850004 Cồn Cỏ        107.  17.2 Quảng Trị     
##  5     850005 Vân Hồ        105.  20.8 Sơn La        
##  6     834298 Lý Sơn        109.  15.4 Quảng Ngãi    
##  7     834671 Đạ Tẻh        108.  11.6 Lâm Đồng      
##  8     834269 Hương Trà     107.  16.4 Thừa Thiên Huế
##  9     834290 Hiệp Đức      108.  15.5 Quảng Nam     
## 10     834320 Bình Long     107.  11.7 Bình Phước    
## # … with 688 more rows

Let’s merge this information with the polygon / census data:

vn2 %<>% 
  left_join(districts, c("province", "district")) %>% 
  select(polygon_id, province, district, n, area_km2, den_km2, lon, lat, geometry)

and this what it looks like now:

vn2
## Simple feature collection with 698 features and 8 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 102.145 ymin: 8.381355 xmax: 109.4692 ymax: 23.39269
## CRS:            +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## First 10 features:
##    polygon_id province   district      n area_km2   den_km2      lon      lat
## 1      834464 An Giang     An Phú 183074 225.1253  813.2095 105.1009 10.84574
## 2      834463 An Giang   Châu Đốc 117244 104.0911 1126.3596 105.0873 10.67470
## 3      834466 An Giang   Châu Phú 255894 450.9987  567.3941 105.1797 10.55759
## 4      834468 An Giang Châu Thành 176945 354.9705  498.4781 105.2659 10.42124
## 5      834469 An Giang    Chợ Mới 353705 369.1722  958.1031 105.4614 10.47487
## 6      834462 An Giang Long Xuyên 303381 114.2415 2655.6104 105.4280 10.37121
## 7      834465 An Giang    Phú Tân 214823 311.6955  689.2079 105.2758 10.65475
## 8      834873 An Giang   Tân Châu 176007 172.0770 1022.8388 105.1859 10.79910
## 9      834470 An Giang  Thoại Sơn 189389 468.9065  403.8950 105.2556 10.29679
## 10     834467 An Giang  Tịnh Biên 124798 354.1079  352.4293 105.0113 10.54809
##                          geometry
## 1  MULTIPOLYGON (((105.1216 10...
## 2  MULTIPOLYGON (((105.1387 10...
## 3  MULTIPOLYGON (((105.327 10....
## 4  MULTIPOLYGON (((105.3082 10...
## 5  MULTIPOLYGON (((105.4729 10...
## 6  MULTIPOLYGON (((105.4729 10...
## 7  MULTIPOLYGON (((105.327 10....
## 8  MULTIPOLYGON (((105.2463 10...
## 9  MULTIPOLYGON (((105.2532 10...
## 10 MULTIPOLYGON (((105.1139 10...

Exploring the colocation data

Coverage: comparing facebook population with census population

The following function combines the facebook data with the GADM / census data for each week:

dist_fb_populations <- function(x) {
  x %>%
    select(polygon1_id, fb_population_1) %>% 
    distinct() %>% 
    right_join(select(st_drop_geometry(vn2), -area_km2), c("polygon1_id" = "polygon_id"))
}

Let’s compute it:

tmp <- map(colocation, dist_fb_populations)

The facebook population of each district does not seem to change as a function of time:

xs <- ymd(names(colocation)) - 6
plot(xs, seq_along(xs), ylim = c(0, 5), type = "n")

tmp %>% 
  map_dfc(pull, fb_population_1) %>% 
  t() %>% 
  as.data.frame() %>% 
  map(log10) %>% 
  walk(lines, x = xs, col = adjustcolor("black", .25))

And the distribution accross district looks like this:

tmp %>% 
  map(mutate, prop = fb_population_1 / n) %>% 
  first() %>%
  pull(prop) %>%
  hist2(50, xlab = "facebook coverage", ylab = "number of districts")

Let’s look at the facebook coverage as a function of the population size for the first week of the colocation data:

tmp <- colocation$`2020-03-03` %>%
  select(polygon1_id, fb_population_1) %>% 
  distinct() %>% 
  right_join(select(st_drop_geometry(vn2), -area_km2), c("polygon1_id" = "polygon_id"))
  
summary(lm(log10(fb_population_1) ~ log10(n), tmp))
## 
## Call:
## lm(formula = log10(fb_population_1) ~ log10(n), data = tmp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03536 -0.19070 -0.01065  0.18854  1.33225 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.55716    0.18196  -25.05   <2e-16 ***
## log10(n)     1.48787    0.03592   41.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2796 on 690 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.7132, Adjusted R-squared:  0.7128 
## F-statistic:  1716 on 1 and 690 DF,  p-value: < 2.2e-16
with(tmp, plot(log10(n), log10(fb_population_1), col = "blue", axes = FALSE,
               ylim = c(1, 4.5), xlim = c(3.8, 6),
               xlab = "district population", ylab = "facebook population"))
axis(1, 4:6, format(10000 * c(1, 10, 100), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10 * c(1, 10, 100, 1000), big.mark = ",", scientific = FALSE, trim = TRUE))

Meaning that the facebook coverage increases with population size.

Rearranging the colocation data into a matrix

Let’s generate a template with all the combinations of districts:

template <- districts %>% 
  arrange(polygon_id) %$%
  expand.grid(polygon_id, polygon_id) %>% 
  as_tibble() %>% 
  setNames(c("polygon1_id", "polygon2_id"))

The function that transforms the colocation data into a matrix:

to_matrix <- function(df, template) {
  dim_names <- sort(unique(template$polygon1_id))
  df %>% 
    select(polygon1_id, polygon2_id, link_value) %>% 
    left_join(template, ., c("polygon1_id", "polygon2_id")) %>%
    mutate(link_value = replace_na(link_value, 0)) %>% 
    pull(link_value) %>% 
    matrix(nrow(districts)) %>%
    `colnames<-`(dim_names) %>% 
    `rownames<-`(dim_names)
}

Let’s do it for all the weeks and sum them:

coloc_mat <- colocation %>% 
  map(to_matrix, template) %>% 
  reduce(`+`)

Let’s have a look at this matrix. Let’s first order the district from south to north and from west to east:

hash <- setNames(seq_along(colnames(coloc_mat)),
                           colnames(coloc_mat))
ind <- districts %>% 
  arrange(lat, lon) %>% 
  pull(polygon_id) %>% 
  as.character() %>% 
  magrittr::extract(hash, .)

coloc_mat <- coloc_mat[ind, ind]

Let’s now plot the matrix:

opar <- par(pty = "s")
image(log10(apply(t(coloc_mat), 2, rev)), axes = FALSE)
par(opar)
box(bty = "o")

We can verify that the matrix is symmetric and that both Hanoi and Saigon are well connected to everywhere in the country. We can see also the district that are not connected at all.

Subsetting according to coordinates

Let’s say we want to select all the provinces from the Northern EPI, the southernmost province of which is Nghệ An. Let’s retrieve the latitude of the centroids of all the provinces:

tmp <- vn1 %>% 
  st_centroid() %>% 
  st_coordinates() %>% 
  as_tibble() %>% 
  pull(Y) %>% 
  mutate(vn1, lat_cent = .) %>% 
  select(NAME_1, lat_cent)
## Warning in st_centroid.sf(.): st_centroid assumes attributes are constant over
## geometries of x
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for longitude/
## latitude data

The province south of Nghệ An is Hà Tĩnh, the latitude of centroid of which is:

threshold <- tmp %>% 
  filter(NAME_1 == "Hà Tĩnh") %>% 
  pull(lat_cent)

Now, we retrieve all the names of all the provinces that are north of this threshold:

northernEPI <- tmp %>% 
  filter(lat_cent > threshold) %>% 
  pull(NAME_1)

Now, we retrieve the corresponding districts’ ID:

sel <- districts %>% 
  filter(province %in% northernEPI) %>% 
  pull(polygon_id)

And now we can subset our matrix:

sel <- colnames(coloc_mat) %in% sel
coloc_mat_northernEPI <- coloc_mat[sel, sel]
hash <- setNames(seq_along(colnames(coloc_mat_northernEPI)),
                           colnames(coloc_mat_northernEPI))

ind <- districts %>% 
  filter(province %in% northernEPI) %>%
  arrange(lat, lon) %>% 
  pull(polygon_id) %>% 
  as.character() %>% 
  magrittr::extract(hash, .)

coloc_mat_northernEPI <- coloc_mat_northernEPI[ind, ind]

Which gives:

opar <- par(pty = "s")
image(log10(apply(t(coloc_mat_northernEPI), 2, rev)), axes = FALSE)
par(opar)
box(bty = "o")

Exploring the colocation data

colocation$`2020-03-03` %>% 
  select(polygon1_id, link_value, fb_population_1) %>% 
  group_by(polygon1_id) %>% 
  summarise(link   = sum(link_value),
            fb_pop = unique(fb_population_1)) %>%
  map(log) %$% 
  plot(fb_pop, link)
## `summarise()` ungrouping output (override with `.groups` argument)

SEIR Metapopulation Contact Model

The following model uses colocation data to formulate the coupling in a SEIR metapopulation model. The equations that describe the epidynamics in each of the populations is given below: \[ \begin{aligned} \frac{dS_i}{dt} & = - S_i \sum_{k} \beta_{ik} \frac{I_k}{N_k} \\ \frac{dE_i}{dt} & = S_i \sum_{k} \beta_{ik} \frac{I_k}{N_k} - \sigma E_i \\ \frac{dI_i}{dt} & = \sigma E_i - \gamma I_i \\ \frac{dR_i}{dt} & = \gamma I_i \\ \end{aligned} \] where \(\frac{1}{\sigma} = 7\) days is the latency period and \(\frac{1}{\gamma} = 7\) is the recovery period. Note that \(N_i = S_i + E_i + I_i + R_i\). We assume that \(\beta_{ik}\) which represents the transmission rate from population \(k\) to population \(i\) is proportional to \(C_{ik}\), the colocation probability for population \(i\) and population \(k\).

The following parameterization is an adaptation of the contact model given in the paper, Modeling the impact of human mobility and travel restrictions on the potential spread of SARS-CoV-2 in Taiwan.

The \(\beta_{ik}\) are defined as follows: \[ \begin{aligned} \beta_{ik} = \gamma R_{0 ik} \end{aligned} \] where \[ \begin{aligned} R_{0 ik} & = R_{0 \: Hanoi'} \frac{C_{ij}}{C_{Hanoi'-Hanoi'}} \\ \end{aligned} \] and \(Hanoi'\) is the district in Hà Nội with the highest density.

Let’s calculate the district in Hanoi with the highest density.

HanoiDist <- vn2 %>%
  filter(province == "Hà Nội") %>% 
  arrange(desc(den_km2)) %>%
  head(1) %>%
  pull(district)

HanoiDistID <- vn2 %>%
  filter(province == "Hà Nội") %>%
  filter(district == HanoiDist) %>% 
  pull(polygon_id)

As such, we will set the \(R_0\) of Hoàn Kiếm to 18; that is, \(R_{0 \: Hanoi'} = 18\).

The following function computes the \(R_{0 ik}\) values with the standard reference district with \(R_0\) of standardR0 and district ID of standardDistID.

compute_R0 <- function(coloc_mat, standardR0, standardDistID) {
  melt(coloc_mat) %>% 
    setNames(c("polygon1_id", "polygon2_id", "link_value")) %>%
    mutate(R0 = standardR0 * link_value / coloc_mat[paste(standardDistID), paste(standardDistID)])
}

Let’s compute the \(R_{0 ik}\) with \(R_{0 \: Hanoi'} = 18\):

R0 <- compute_R0(coloc_mat, 18, HanoiDistID)

Intracity \(R_0\)

Let’s see the distribution of the intracity \(R_0\) values; that is, the distribution of \(R_{0 ii}\) values.

intracityR0 <- R0 %>%
  filter(polygon1_id == polygon2_id) %>%
  rename(intraR0 = R0)
hist2(intracityR0$intraR0, n = 50, xlab = "Intracity R0", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 60, 10))
axis(2)

cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
hist2(intracityR0$intraR0, quantile(intracityR0$intraR0, seq(0, 1, le = 11)), col = pal, axes = FALSE,
      xlab = "Intracity R0", ylab = "density of probability")
axis(1, seq(0, 60, 10))
axis(2)

quantile(intracityR0$intraR0, seq(0, 1, le = 11))
##        0%       10%       20%       30%       40%       50%       60%       70% 
##  0.000000  1.586063  2.008983  2.373354  2.887077  3.456204  4.453696  6.125704 
##       80%       90%      100% 
##  8.322006 13.101921 55.396777

Let’s map the intracity \(R_0\) values for the districts:

intracityR0sf <- vn2 %>%
  right_join(select(intracityR0, -polygon2_id), c("polygon_id" = "polygon1_id"))

The following district have intracity \(R0\) greater than 10.

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intracityR0sf %>%
  select("intraR0") %>% 
  filter(intraR0 > 10) %>%
  plot(add = TRUE, col = "red")

The following districts have intracity \(R0\) greater than 20.

intracityR0sf %>% 
  filter(intraR0 > 20) %>%
  arrange(desc(intraR0)) %>%
  st_drop_geometry() %>%
  select(province, district, intraR0)
##       province       district  intraR0
## 1      Yên Bái       Trạm Tấu 55.39678
## 2     Hà Giang   Hoàng Su Phì 42.67776
## 3    Quảng Nam      Tây Giang 42.55601
## 4   Quảng Ngãi        Tây Trà 41.49006
## 5      Kon Tum      Kon Plông 33.77775
## 6    Thanh Hóa      Mường Lát 32.71860
## 7       Sơn La        Sốp Cộp 32.20555
## 8    Quảng Nam      Phước Sơn 31.46382
## 9     Hà Giang        Xín Mần 31.07934
## 10    Hà Giang       Đồng Văn 29.85194
## 11  Quảng Ngãi         Lý Sơn 29.61061
## 12    Hà Giang        Mèo Vạc 28.03566
## 13    Cao Bằng        Bảo Lạc 27.90237
## 14   Quảng Nam     Nam Trà My 26.41899
## 15    Lạng Sơn       Đình Lập 26.37892
## 16    Cao Bằng        Bảo Lâm 24.83981
## 17      Sơn La        Bắc Yên 24.52896
## 18  Quảng Ninh         Ba Chẽ 23.66076
## 19  Quảng Ninh      Bình Liêu 23.55896
## 20   Điện Biên       Tủa Chùa 23.15901
## 21    Cao Bằng     Thông Nông 23.08779
## 22   Điện Biên Điện Biên Đông 22.44800
## 23  Bình Thuận        Phú Quí 22.33890
## 24 Hồ Chí Minh              4 22.28844
## 25     Gia Lai        Ayun Pa 22.03421
## 26    Hà Giang        Quản Bạ 20.98567
## 27 Hồ Chí Minh              3 20.84098
## 28     Lào Cai      Si Ma Cai 20.67039
## 29   Quảng Trị      Quảng Trị 20.22414
vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intracityR0sf %>% 
  filter(intraR0 > 20) %>% 
  select(intraR0) %>% 
  plot(add = TRUE, col = "red")

It appears that remote regions of Vietnam have relatively higher intracity \(R_0\) values. Let’s verify this by graphing relationship between population density and intracity \(R_0\) values under the assumption that more remote regions will have lower population densities.

ggplot(intracityR0sf, aes(x = log10(den_km2), y = intraR0)) + geom_point()

Both populations with relatively low population densities and populations with relatively high population densities have high intracity \(R_0\) values. Populations with especially low densities have the greatest intracity \(R_0\) values. Note that we should see a positive relationship between population density and intracity \(R_0\) values. The intracity \(R_0\) values need to be adjusted to fix the artifact of how the colocation matrix is constructed.

Intercity \(R_0\)

Let’s consider the intercity R0 values:

intercityR0 <- R0 %>%
  filter(polygon1_id != polygon2_id) %>%
  rename(interR0 = R0)

totalInter <- intercityR0 %>%
  group_by(polygon1_id) %>%
  summarize(interR0 = sum(interR0)) %>%
  arrange(desc(interR0))
## `summarise()` ungrouping output (override with `.groups` argument)

Let’s map the intracity \(R_0\) values for the districts: The following districts have intercityR0 values greater than 1:

intercityR0sf <- vn2 %>%
  right_join(totalInter, c("polygon_id" = "polygon1_id"))
vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

intercityR0sf %>%
  select("interR0") %>% 
  filter(interR0 > 1) %>%
  plot(add = TRUE, col = "red")

Let’s graph the relationship between population density and intercity \(R_0\) values:

ggplot(intercityR0sf, aes(x = log10(den_km2), y = interR0)) + geom_point()

As we would expect, population centers with higher population densities are better connected with surrounding regions resulting in higher intercity \(R_0\) values. Hence, the intercity \(R_0\) values under this model are consistent with epidemiological assumptions.

Intercity and Intracity \(R_0\) values

Let’s graph the relationship between intracity \(R_0\) values and intercity \(R_0\) values:

tmp <- intercityR0sf %>%
  right_join(select(intracityR0, -polygon2_id, -link_value), c("polygon_id" = "polygon1_id"))
ggplot(tmp, aes(x = intraR0, y = interR0)) + geom_point() +
  geom_abline(slope = 1, intercept = 0)

There are two distinct groups in the plot: one group has larger intercity \(R_0\) values relative to intracity \(R_0\) values, while the other has larger intracity \(R_0\) values relative to intercity \(R_0\) values.

Let’s use a linear mixture model to assign each point to a particular linear model:

library(flexmix)
## Loading required package: lattice
model <- flexmix(interR0 ~ intraR0, tmp, k = 2)
summary(model)
## 
## Call:
## flexmix(formula = interR0 ~ intraR0, data = tmp, k = 2)
## 
##        prior size post>0 ratio
## Comp.1 0.219  124    666 0.186
## Comp.2 0.781  574    643 0.893
## 
## 'log Lik.' -498.521 (df=7)
## AIC: 1011.042   BIC: 1042.88
plot(tmp$interR0, tmp$intraR0, col = clusters(model), 
     xlab = "intercity R0", ylab = "intracity R0")

Let’s map the clustering observed in the linear mixture model to the districts:

vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

colors <- clusters(model)
colors[colors == 1] = 3

tmp %>% 
  transmute(cluster = clusters(model)) %>%
  st_geometry() %>%
  plot(add = TRUE, col = colors)

It is a little noisy.

Gravity model

One model fairly used in geography is the so-called gravity model that says that the connection between any 2 locations \(i\) and \(j\) is:

\[ C_{ij} = \theta\frac{P_i^{\tau_1}P_j^{\tau_2}}{d_{ij}^\rho} \] See for example 10.1126/science.1125237.

TO DO: test this model on the facebook data.

Effective distance

See 10.1126/science.1245200